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CN , Abstract 



We study recursive algorithm for computing DCT of lengths N — q2™ (m, g G N, g is odd) due 
to C. W. Kok |fT6l . We show that this algorithm has the same multiplicative complexity as theoretically 
achievable by the prime factor decomposition, when m ^ 2. We also show that C. W. Kok's factorization 
i allows a simple conversion to a scaled form. We analyze complexity of such a scaled factorization, 

and show that for some lengths it achieves lower multiplicative complexity than one of known prime 
Q \ factor-based scaled transforms lfT4l . 



Index Terms 

Discrete cosine transform, DCT, scaled transform, factorization, multiplicative complexity. 

I. Introduction 

The discrete cosine transform (DCT) |[ll-|l3l is a fundamental and frequently used operation in modern 
digital signal processing. It finds applications in data compression, filter design, image recognition, etc. 
Scaled DCT is a modified version of this transform, allowing the output to be scaled in way that simplifies 
its computation f6l. Scaled DCTs are particularly popular in data compression, where scaling of DCT 
output can usually be done jointly with quantization, therefore reducing the complexity of the entire 
algorithm ifTTI. |[T2ll. 

Since its discovery in early 1970s, DCT has been a subject of extensive research, focusing, in part, 
on the design of fast algorithms for its computation 111, 111. The class of DCT of type II (DCT-II) 
with dyadic lengths N = 2™, m G N has been studied particularly well. Both theoretical complexity 
estimates 111-1161 and a number of efficient algorithms for their construction have been derived |[7]|- |[T0l . 
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|[T3l . The construction of scaled DCT-II of dyadic lengths has also been studied ||6l, lITTI . |[T2l . Scaled 
factorizations of Y. Aral, T.Agui and M. Nakajima ifTTI and E. Feig and S. Winograd are among 
best-known algorithms from this class. The construction of odd-length transforms has been studied by 
M. Heideman [17], and S. Chan and K. Ho |18|, uncovering, in part, an elegant connection between real 
valued DFT and DCT-II of same lengths. The construction of DCT of composite sizes, such as N = p q, 
where p and q are co-prime, was studied by P. Yang and M. Narasimha |[T9l . B.G. Lee |[20l . and others, 
resulting in the development of the prime-factor decomposition of the DCT-II. E. Feig and E. Linzer have 
further extended the prime-factor technique for computing scaled transforms |[T4l . The construction of 
DCT of even (but not dyadic) lengths has been addressed by a variety of techniques, ranging from prime- 
factor decompositions |fT9l . |[20l to generalizations of the radix-2 DCT algorithm llT6l . Reference |[T6l 
contains comparison of several such approaches. 

In this correspondence we take another look at the recursive algorithm for computing of DCT of 
lengths N = q 2™, m, g G N, proposed by C. W. Kok [ 16|. We offer an alternative matrix formulation of 
this algorithm, its detailed complexity analysis, and a modification allowing to compute a scaled DCT. 
We show, that C. W. Kok's algorithm achieves the same multiplicative complexity as one theoretically 
attainable by the prime factor decomposition when m ^ 2. We also show that for some lengths our 
proposed scaled version of C. W. Kok's algorithm achieves lower multiplicative complexity than one of 
known scaled prime factor algorithm-based factorizations |[T4l . We accompany our presentation with 
several examples of scaled factorizations constructed by using described algorithms, and complexity 
comparison plots that can be of interest to the engineering community. 

This correspondence is organized as follows. In Section [III we introduce notation and survey relevant 
results. In Section [Till we offer matrix formulation of C.W. Kok's algorithm and its complexity anal- 
ysis. An modified (scaled) version of C. W. Kok's factorization is described in Section |IVl Section |V] 
contains comparison with prime factor-based implementations. Section |Vl] brings remarks on normalized 
multiplicative complexity of composite-length transforms. Conclusions are drawn in Section IVIII 
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II. Notation and Some Basic Facts 

By CfJ, C}J^ and Cj/ we will denote matrices of iV-point DCT-II, DCT-III, and DCT-IV transforms 
correspondingljQ 



[^n] n,k 
[^n'] n,k 
[^n"] n,k 



COS 



cos 



COS 



TT (2n+l) k 
2N 

7rra(2fc+l) 
2N 

7r(2n+l)(fc+l) 
AN 



> n,k = 0,...,N-1. 



Among these transforms, the DCT of type II (DCT-II) is the one that we will need to compute. 
It is well known (see, e.g. Ill), that the DCT-III is simply an inverse (or transpose) of DCT-II 

— y-'N ) — V-'N ) ) 

and that DCT-IV is involutary (self-inverse, self-transpose) 

It is also known (cf. S. C. Chan and K.L. Ho HI, C. W. Kok lH), that DCT-IV and DCT-II are 

connected as follows 



C 



IV 

N 



where -Rat is a matrix of recursive subtractions 
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and Z) AT is a diagonal matrix 
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'For simplicity, we omit normalization factors (2). 
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Further, it is known (cf. W. H. Chen, et. al. [TJ, Z. Wang ||8l) that if length of DCT is even, then it can 
be factored into two half-length transforms 

Ck' = Pn ( J U^, (2) 

where P/v is a permutation matrix, producing reordering 

Xi=X2i, Xjql2+i = X2i+l-, i = 0, . . . , iV/2 - 1, 

Stv is a butterfly 

Bn - 



In/2 Jn/2 



Jn/2 —In/2 



and where /7V/2 and J7V/2 denote N/2 x N/2 identity and order reversal matrices correspondingly. 



III. Recursive DCT-II Computation. Algorithm of C. W. Kok IIT6II 

We note that factorization Q is not fully recursive: it uses DCT-II and DCT-IV of length N/2 as 
building blocks, but subsequent factorization of DCT-IV is not defined. One possible way of closing the 
recursion is to simply replace DCT-IV with DCT-II in accordance with ([T]). This way we arrive at the 
following factorization: 

Ci.^ = P^(^^'^ \b^. (3) 

\ P-N /2C ^ 12^ N /2J N /2 j 

This factorization can be applied recursively, producing a simple algorithm for computing of DCT-II of 
even lengths, known as C.W Kok's algorithm |[T6i . 

We show the flowgraph of this algorithm in Fig. [T] As customary, dashed lines in the flowgraph 
denote sign inversions, circles indicate additions, and constants above lines indicate multiplications by 
the corresponding factors. 

Based on Fig. [H it can be observed that the numbers of multiplications fi{N), additions and subtrac- 
tions a{N), and shifts (multiplications by dyadic factors) <t{N) satisfy 

/i(iV) = 2fiiN/2) + ^N, 
a{N) = 2a{N/2) + |iV - 1 , 
a{N) = 2a{N/2) + l. 

By applying this decomposition recursively m-times, we arrive at the following result (cf. HH). 
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Proposition 1 (C. W. Kok, 1997). The numbers of arithmetic operations (fi,a,a) needed for computing 
DCT-II of length N = q2^ using C. W.Kok's algorithm, satisfy: 

fz{N) = 2X<7) + fiV, (4) 
a{N) = 2™a(g) + - 2"^ + 1 , 
a{N) = 2'"c7(g) + 2™ - 1 . 
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IV. Proposed Alternative (Scaled) Factorization 

Consider DCT-II factorization ([2]) one more time. Since DCT-IV is involutary, we can compute it in a 
transposed fashion, producing (cf. ([T])): 

Cj/ = (^RnC^n ^n) = DnCII^R^ . (5) 
By plugging this expression in Q, we arrive at the following alternative decimation scheme: 

C- = pJ^-/^ \ \b,. (6) 

\ Dj^I2Cj^i^Rj^i^JnI2 J 

Since only the order of operations has changed, the complexity of this decimation scheme and one 
used in C. W. Kok's algorithm Q must be exactly the same. At the same time, as shown in Fig.|2l this 
modified factorizations moves all the factors associated with matrix D7V/2 to the last stage. This means, 
that if it is sufficient to compute a scaled version of the transform, such multiplications can be avoided. 
Proposed factorization, therefore, is well suitable for implementation of scaled transforms. 

Hereafter, we will say that DCT factorization is scaled, if it can be presented as 

where IIjv is a reordering matrix, and Ajv is a diagonal matrix of scale factors, and C*^ is a matrix of 
the scaled transform. 

By using such representation, we can rewrite Q as 

\ nil I ^N/2^N/2C"/2 ) „ 

implying, that scaled part of the transform can be computed recursively as follows 

= i ^^'^ I \ B^. (8) 

The associated reordering and scaling matrices can also be computed recursively by using 

In order to compute the remaining DCT-III block in ((S]), we can either pick some existing (non-scaled) 
factorization, or reuse our scaled design (l8]|9]l followed by conversion to full (non-scaled) transform 
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A. Complexity Analysis 

As already noticed, the complexity of computing DCT-II by using our factorization Q is identical to 
one of C.W.Kok's algorithm dlj). However, when only a scaled transform ^ needs to be computed, some 
operations can be saved. Based on Fig.|2j we can establish the following relations: 

fi{N) = /i(iV/2)+/i(iV/2), 

a{N) = a{N/2) +a{N/2) + -I, 

aiN) = a{N/2) + aiN/2) + l, 

where /i, d, and a denote the number of multiplications, additions, and shift operations correspondingly 
needed for computing scaled transforms C^^ , and where fi, a, and a represent numbers of operations 
needed for computing lower non-scaled blocks C^^^ . By applying this decomposition recursively m-times, 
we arrive at the following result. 

Proposition 2. The numbers of arithmetic operations (fj,,a,a) needed for computing of scaled DCT-II 
of length N = q2"^ using factorization (jSj) satisfy: 

fL{N) = ^(g) + (2™ + (f -1 + 2— )iV, (11) 
a{N) = d(g) + (2- - l)a(g) + ^iV - 2"^ + 1 , 
a{N) = + (2™ - l)a(g) + 2"^ - 1 . 

By comparing (fTTI ) with the number of multiplications required in C. W. Kok's algorithm (0]), we can 
conclude that the use of our proposed scaled factorization saves at least 

M(iV) - m) = Kq) - m - (1 - 2-") iV ^ (1 - 2—) N 
multiplications. When number of iterations m is large, it can be further observed that 

p,{N) - ii{N) N 

approaching the well known upper bound for multiplicative complexity reduction realizable by scaled 
transforms 161 . 

B. Construction Examples 

We note that in many practical situations, the multiplications by factors 1/2 in our scheme can be 
avoided. Below, we provide two examples showing how this can be accomplished. 
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1 ) Scaled OCT of lengths N = 2™; We scale the matrix of 2-point DCT-II as follows 

This moves factors in DC paths, allowing them to be subsequently merged with factors 1/2 in our 
algorithm. We show the resulting flowgraphs in Fig. [3l 

Simple calculations show the number of operations in such scaled factorizations satisfy 



/i(2-) 


= m2™-i- 


2™ + l, 


0(2"^) 


= 37712""-^ 


-2™ + l 


a(2"^) 


= 0. 





For example, when = 8 (largest size shown in Fig. [3]) our algorithm produces factorization with 
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just 5 multiplications and 29 additions. This matches the performance of the well-known scaled DCT 
factorization of Y. Aral, T. Agui, and M. Nakajima ifTTl . 

2) Scaled DCT-II of lengths N = 3 2"^.- We scale the matrix of 3-point DCT-II as follows 



/ 



a 



II 



1 



1 



1 



\ 



cos ^ — cos ^ 



V 



-1 



/ 



2cosf -2cosf 



V 



1 



1 



This brings factor 2 to the DC path, leading to cancelation of factors 1/2 in our algorithm. The resulting 
flowgraphs are shown in Fig. ID 

It can be readily verified that the numbers of operations in such scaled factorizations are 



3m 2"*-^ -2"^+^ +2. 



9m2"-^ + 32™ + 1 



A (3 2™) 
d(3 2™) 
a (32"^) = 2"^. 

For example, a scaled transform of length = 6 shown in Fig.|4]uses only 1 multiplication, 16 additions, 
2 shifts. 



V. Comparison with the Prime Factor Algorithm-based Implementations 

It is known that DCT-II of length N = pq, where p and q are relatively prime, can be computed 
as a cascade of p transforms of length q followed by q transforms of length p ||2l, |[T9l . ll20l . Such a 
decomposition is commonly called a prime factor algorithm (PFA). When one of the prime factors, for 
example p, is dyadic, we arrive at lengths N = q2"^, implying that PFA is an alternative technique for 
computing such transforms. Hence, we are interested in comparison of PFA vs. C. W. Kok's algorithm. 

We report the following result. 

Theorem 1. Multiplicative complexity of DCT-II of length N = ql"^ constructed by using C. W. Kok's 
algorithm matches one theoretically achievable by using prime-factor DCT-II factorization, iff m ^2. 

Proof: Based on PFA structure, the number of multiplications needed to implement transform of 
length N = q2'^ satisfies (cf. Oil, |[l9l) fi{N) = 2"^(g) + g/i(2'^). Furthermore, from complexity 
study of dyadic-length transforms ||4l-||6l we know that /i (2™) ^ 2™+^ — m — 2. Combining these 
formulae, we obtain 

H{N) ^ 2'"/i(g) + q (2"^+^ - m - 2) . 



December 29, 2009 



DRAFT 



10 



TABLE I 

Component short-length DCT-II m, HH, im, lETll. EH 



iV 


DCT 


Scaled DCT 






a 


(7 




a a 


3 


1 


4 


1 





4 1 


5 


4 


13 


1 


2 


13 1 


15 


14 


70 


4 


10 


67 8 


2 


1 


2 








2 


4 


4 


9 





1 


9 


8 


11 


29 





5 


29 


16 


26 


81 





16 


81 



By comparing this result with complexity estimate for C.W.Kok's algorithm (|4]|: 

/x(iV) =2X(7)+(7 2"^f , 

we arrive at the statement of the theorem. ■ 
We now turn our attention to complexity comparison for scaled transforms. 

Proposition 3. Multiplicative complexity of PFA-based scaled DCT-II of length N = qT^ satisfies: 

/i(A^) ^2'"/2(g) + |Ar-g '»("'+3)+5 _2"^-i + 1. (12) 

Proof: We use scaled PFA construction of Feig and Linzer |14|, which yields: /i(A^) ^ 2"*/i(g) + 
q /i(2'") + i [N — 2™ — g + 1). We then apply Feig-Winograd algorithm for computing scaled DCT of 
dyadic lengths 16], for which: fi (2™) = 2"^+^ - !Iil!B±31 _ 2. ■ 
We note, that in order to compare the obtained expression (fT2l ) with one corresponding to our scaled 
version of C.W.Kok's algorithm ([TTI ): 

m) = (2'" - mq) + m + (f - 1 + 2-'") n . 

we need to know complexities of both scaled and non-scaled transforms of length q. For this purpose, we 
will use several short-length DCT-II modules with complexity numbers shown in Table 1. Such odd-length 
transforms can be found in |[17l {N = 3), |21| (iV = 5), and nH, (221 {N = 15). Listed complexity 
numbers for dyadic-length transforms are from 0, fSM . 

In Table 2 we provide comparison of the resulting transforms of composite lengths. Bold font is used 
to highUght best complexity numbers. It can be observed, that for g = 3, 5 our proposed algorithm shows 
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TABLE 11 

Complexity of Scaled DCT-II Factorizations of Lengths iV = q2"^ 
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and Linzer 1141 
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49 
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49 
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24 


22 
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22 


133 
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48 


66 


337 


16 


63 


337 


16 
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10 
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40 
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40 


2 
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20 


19 


109 


7 


19 


109 


4 
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40 


55 


277 


15 


55 


277 


8 
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80 


147 


673 


31 


142 


673 


16 


15 


1 


30 


24 


181 


13 


27 


178 


16 




2 


60 


67 


454 


23 


76 


445 


32 




3 


120 


183 


1090 


43 


204 


1069 


64 




4 


240 


475 


2542 


83 


505 


2497 


128 



identical complexity to Feig-Linzer scaled PFA implementations when m ^ 3. It becomes more complex 
for higher m. For q = lb and m ^ 4 it is shown that our proposed algorithm is more efficient (in 
multiplicative complexity sense) than scaled PFA implementations. 

VI. On Normalized Multiplicative Complexity of Scaled Transforms 

We complement our presentation by providing plots of normalized multiplicative complexity jl{N)/N 
of scaled DCT of lengths = [2"^, 3 2™, 5 2™, 15 2™]. We present these plots in Fig.O It can be observed, 
that among short-length transforms {N ^ 128), scaled dyadic-length transforms are more complex than 
transforms with nearest composite lengths from sequences N = 2)2"^ or N = 15 2™. We believe that 
the use of such composite-length transforms can offer appreciable complexity savings in many practical 
applications. 

VII. Conclusions 

An alternative derivation and detailed complexity analysis of C. W. Kok's algorithm for computing 
DCT of lengths lengths N = g2™ (m,g G N, g is odd) is offered. It is shown that this algorithm 
has the same multiplicative complexity as theoretically achievable by the prime factor decomposition, 
when m ^ 2. Additionally, a scaled DCT factorization based on C. W. Kok's algorithm is proposed. It is 
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□ □□□□□ N=2'^m: Feig-Winograd 

OOOOOO N=3*2'^m: Feig-Linzer/Proposed (m<=3 

H — I — I — I — I — ^ N=5*2'^m: Feig-Linzer/Proposed (m<=3 

OOOOOO N=15*2'^m: Proposed algbttim 



Fig. 5. Normalized multiplicative complexity fl{N) /N of scaled DCT factorizations of lengths iV = [2™ , 3 2™ , 5 2"" , 15 2" 



shown, that for some lengths this scaled factorization achieves lower multiplicative complexity than one 
of known prime factor-based scaled transforms. 
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